{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-a3619358b7078463",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "# Exercise 7) Learning and Planning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-3353a9b5529d1970",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "In this exercise, we will again investigate the inverted pendulum from the `gymnasium` environment. We want to check, which benefits the implementation of planning offers.\n",
    "\n",
    "Please note that the parameter $n$ has a different meaning in the context of planning (number of planning steps per actual step) than in the context of n-step learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-341da976c725d0ba",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from tqdm.notebook import tqdm\n",
    "import random\n",
    "import numpy as np\n",
    "import gymnasium as gym\n",
    "gym.logger.min_level=40\n",
    "plt.style.use('seaborn-v0_8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-92cdcff0f0a476ff",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "We will reuse the discretization routine from the previous exercise:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-be904d3f45f5ff7f",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "d_T = 15\n",
    "d_theta = 15\n",
    "d_omega = 15\n",
    "\n",
    "\n",
    "def discretize_state(states):\n",
    "    limits = [1, 1, 8]\n",
    "    nb_disc_intervals = [d_theta, d_theta, d_omega]\n",
    "\n",
    "    # bring to value range [-1, 1]\n",
    "    norm_states = [state / limit for state, limit in zip(states, limits)]\n",
    "    interval_lengths = [2 / d for d in nb_disc_intervals]\n",
    "    disc_state = [(norm_state + 1) // interval_length for norm_state,\n",
    "                  interval_length in zip(norm_states, interval_lengths)]\n",
    "    disc_state = [(state - 1) if state == d else state for state,\n",
    "                  d in zip(disc_state, nb_disc_intervals)]  # ensure that disc_state < d\n",
    "\n",
    "    return np.array(disc_state)\n",
    "\n",
    "\n",
    "def continualize_action(disc_action):\n",
    "    limit = 2\n",
    "    interval_length = 2 / (d_T-1)\n",
    "    norm_action = disc_action * interval_length\n",
    "    cont_action = (norm_action - 1) * limit\n",
    "    return np.array(cont_action).flatten()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-4846b5fbbddf7cc5",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 1) Dyna-Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-785833d7145b35e8",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Write a Dyna-Q algorithm to solve the inverted pendulum. Check the quality of the result for different number of episodes, number of steps per episode and number of planning steps per interaction.\n",
    "\n",
    "Make sure that the total number of learning steps stays the same for different n, such that comparisons are fair:\n",
    "\n",
    "$\\text{episodes} \\cdot \\text{steps} \\cdot (1+n) = \\text{const.}$\n",
    "\n",
    "Interesting metrics for a comparison could be e.g. the execution time (the tqdm loading bar shows execution time of loops, alternatively you can use the time.time() command to get the momentary system time in seconds) and training stability. \n",
    "\n",
    "![](DynaQ_Algo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-46e520741979f3d8",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## Solution 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-bcde5eb71e92fdc2",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "The solution code is given below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-8c85133baab1af6a",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def interact(env, pi, state, deterministic, epsilon):\n",
    "    \"\"\"Interact with the environment to get to the next state.\n",
    "\n",
    "    Args:\n",
    "        env: The environment to interact with\n",
    "        pi: The policy\n",
    "        state: The state before interaction\n",
    "        deterministic: Whether actions are chosen deterministically or eps-greedily\n",
    "        epsilon: The probability for random interaction\n",
    "\n",
    "    Returns:\n",
    "        next_state: The (discretized) next state after interaction\n",
    "        reward: The reward for the current interaction\n",
    "        terminated: If the goal was reached\n",
    "        action: The applied action\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return next_state, reward, terminated, action\n",
    "\n",
    "\n",
    "def learn(action_values, pi, state, action, next_state, reward, gamma, alpha):\n",
    "    \"\"\"Learn from an interaction with the environment. This process is\n",
    "    identical for momentary data and data sampled from the model.\n",
    "\n",
    "    Args:\n",
    "        action_values: The action-value estimates before the update step\n",
    "        pi: The policy before the update step\n",
    "        state: The state before interaction\n",
    "        action: The chosen action\n",
    "        next_state: The state after interaction\n",
    "        reward: The reward for the interaction\n",
    "        gamma: Discount factor\n",
    "        alpha: Step size\n",
    "\n",
    "    Returns:\n",
    "        action_values: The updated action-value estimates\n",
    "        pi: The updated policy\n",
    "    \n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return action_values, pi\n",
    "\n",
    "\n",
    "def learn_planning_from_experience(action_values, pi, model, gamma, alpha):\n",
    "    \"\"\"Sample interactions from the environment and learn from them.\n",
    "\n",
    "    Args:\n",
    "        action_values: The action-value estimates before the update step\n",
    "        pi: The policy before the update step\n",
    "        model: The current model of past experiences\n",
    "        gamma: Discount factor\n",
    "        alpha: Step size\n",
    "\n",
    "    Returns:\n",
    "        action_values: The updated action-value estimates\n",
    "        pi: The updated policy\n",
    "\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return action_values, pi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pendulumDynaQ(alpha, gamma, epsilon, n, nb_episodes, nb_steps):\n",
    "\n",
    "    env = gym.make('Pendulum-v1')\n",
    "    env = env.unwrapped\n",
    "\n",
    "    action_values = np.zeros([d_theta, d_theta, d_omega, d_T])\n",
    "    pi = np.zeros([d_theta, d_theta, d_omega])\n",
    "\n",
    "    model = {}\n",
    "\n",
    "    cumulative_reward_history = [] # we can use this to figure out how well the learning worked\n",
    "    pbar = tqdm(range(nb_episodes))\n",
    "\n",
    "    for _ in pbar:\n",
    "        \n",
    "        rewards = [] # this time, this list is only for monitoring\n",
    "\n",
    "        state, _ = env.reset() # initialize x_0\n",
    "        state = tuple(discretize_state(state).astype(int))\n",
    "\n",
    "        for _ in range(nb_steps):\n",
    "\n",
    "            # YOUR CODE HERE\n",
    "            raise NotImplementedError()\n",
    "        \n",
    "        pbar.set_description(f\"Cumulative_reward {np.sum(rewards):6.0f}, Progress\")\n",
    "        cumulative_reward_history.append(np.sum(rewards))\n",
    "\n",
    "    env.close()    \n",
    "    return cumulative_reward_history, pi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ad72a11ce720c2ca",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Function to evaluate and render the measurement using the gym environment:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-1e3b5eabe632e0b0",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def experiment(pi, nb_steps=300, render=False):\n",
    "# Runs the inverted pendulum experiment using policy pi for nb_steps steps\n",
    "\n",
    "    if render:\n",
    "        env = gym.make('Pendulum-v1', render_mode=\"human\")\n",
    "    else:\n",
    "        env = gym.make('Pendulum-v1')\n",
    "    env = env.unwrapped\n",
    "\n",
    "    state, _ = env.reset() # initialize x_0\n",
    "    state = tuple(discretize_state(state).astype(int))\n",
    "\n",
    "    for k in range(nb_steps):\n",
    "        if render:\n",
    "            env.render() # comment out for faster execution\n",
    "\n",
    "        next_state, reward, terminated, action = interact(\n",
    "            env, pi, state, deterministic=True, epsilon=0\n",
    "        )\n",
    "\n",
    "        state = next_state\n",
    "\n",
    "        if terminated:\n",
    "            break\n",
    "\n",
    "    env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-17792193ff75f321",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "Let's use nb_episodes = 5000, nb_steps = 500, n = 0 as a first try. This is effectively Q learning.\n",
    "\n",
    "\\begin{align}\n",
    "5000 \\cdot 500 \\cdot (0+1) &= 2.5 \\cdot 10^6\n",
    "\\end{align}\n",
    "\n",
    "The resulting policy is satisfactory.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-6c921cba34a87b7e",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# train\n",
    "print(\"Run without planning\")\n",
    "no_planning_history, no_planning_pi = pendulumDynaQ(alpha=0.1, gamma=0.9, epsilon=0.1, n=0, nb_episodes=5000, nb_steps=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-6e95fa1942d32845",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# run and render the experiment\n",
    "experiment(no_planning_pi, nb_steps=300, render=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-236faf8290c72bc3",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "Now let's try $n=9$ with the same nb_steps = 500:\n",
    "\n",
    "\\begin{align}\n",
    "\\text{nb\\_episodes} &= \\frac{2.5 \\cdot 10^6}{500 \\cdot (9+1)} = 500\n",
    "\\end{align}\n",
    "\n",
    "The resulting policy also looks good.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-e97227c315bad07d",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# train\n",
    "print(\"Run with planning\")\n",
    "with_planning_history, with_planning_pi = pendulumDynaQ(alpha=0.1, gamma=0.9, epsilon=0.1, n=9, nb_episodes=500, nb_steps=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-2950e36d13673fb9",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# run and render the experiment\n",
    "experiment(with_planning_pi, nb_steps=300, render=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-f874b3a97814be53",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Now lets compare the cumulative rewards:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-4fc08c4ca8bbd031",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "plt.plot(no_planning_history, label=\"no planning\")\n",
    "plt.plot(with_planning_history, label=\"planning from experience\")\n",
    "plt.xlabel(\"episode\")\n",
    "plt.ylabel(r\"$\\sum R$\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-de1462d7fe9bdbed",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "\n",
    "The cumulative reward over the episodes both seems to have high variance.\n",
    "\n",
    "- Do you have an idea why? Consider the initial state of the environment after resetting.\n",
    "\n",
    "So why should we prefer the planning method?\n",
    "\n",
    "- **Planning leads to the agent interacting less often with the \"real\" environment, such that in the end less interaction time is needed.** To motivate this, imagine that you interact with a real test bench instead of a simulation and that every interaction with the test bench is costly either in terms of time or in terms of money. When compute costs are lower than interaction costs, it makes sense to revisit already gathered data instead of perfoming real interactions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-2810aacf498563db",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## 2) Simulation-based planning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-eb709d007371a52d",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Although it can be useful for small state spaces, building a system model by storing large amounts of state transitions like in task (1) is rarely feasible in engineering. As engineers, we are capable of a more efficient way of system modeling that we can utilize here: differential equations.\n",
    "\n",
    "Using a state-space model allows to efficiently integrate existing pre-knowledge into the Dyna-Q algorithm we already used. To do so, write a class `pendulum_model` that implements a model of the pendulum. This class should work similar to `gymnasium`: it should at least have a `step` and a `reset` method. In the step method, make use of forward Euler integration to simulate the system dynamics. In the reset method, allow to pass an optional initial state to the model, such that we can easily compare model and environment. If no initial state is passed to the `reset` function, a random initial state should be determined.\n",
    "\n",
    "Integrate this model into a Dyna-Q algorithm."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-4f43d13fe3da4001",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Model of the pendulum in differential-equation form for change of the angular frequency $\\omega$ and the angle $\\theta$ depending on the torque $T_\\mathrm{u}$:\n",
    "\n",
    "\\begin{align*}\n",
    "\\dot{\\omega} &= -\\frac{3 g}{2 l} \\text{sin}(\\theta +\\pi) + \\frac{1}{J} T_\\mathrm{u}\n",
    "\\\\\n",
    "\\dot{\\theta} &= \\omega\n",
    "\\end{align*}\n",
    "\n",
    "\n",
    "\n",
    "Parameters (gravity constant $g$, mass $m$, length  $l$ and intertia $J$ of the pendulum):\n",
    "\n",
    "\\begin{align*}\n",
    "g&=10 \\, \\frac{\\text{m}}{\\text{s}^2}\n",
    "\\\\\n",
    "m&=1 \\, \\text{kg}\n",
    "\\\\\n",
    "l&=1 \\, \\text{m}\n",
    "\\\\\n",
    "J&=\\frac{1}{3} m l^2\n",
    "\\end{align*}\n",
    "\n",
    "Forward Euler integration:\n",
    "\n",
    "\\begin{align*}\n",
    "\\dot{x}(k T_S) \\approx \\frac{x[k+1] - x[k]}{T_S}\n",
    "\\end{align*}\n",
    "\n",
    "with sampling time $T_S = 0.05 \\, \\text{s}$\n",
    "\n",
    "Reward function:\n",
    "\n",
    "\\begin{align*}\n",
    "r_{k+1} = -(\\theta^2[k] + 0.1 \\, \\text{s}^2 \\cdot \\omega^2[k] + 0.001 \\frac{1}{(\\text{N}\\text{m})^2} \\cdot T_\\mathrm{u}^2[k])\n",
    "\\end{align*}\n",
    "\n",
    "Limitations of state and action space:\n",
    "\\begin{align*}\n",
    "\\theta &\\in [-\\pi, \\pi]\n",
    "\\\\\n",
    "\\omega &\\in [-8  \\, \\frac{1}{\\text{s}}, 8  \\, \\frac{1}{\\text{s}}]\n",
    "\\\\\n",
    "T_\\mathrm{u} &\\in [-2 \\, \\text{N}\\text{m}, 2 \\, \\text{N}\\text{m}]\n",
    "\\end{align*}\n",
    "\n",
    "And of course input and output space:\n",
    "\\begin{align*}\n",
    "\\text{action}&=T_\\mathrm{u}\n",
    "\\\\\n",
    "\\text{state}&=\n",
    "\\begin{bmatrix}\n",
    "\\text{cos}(\\theta)\\\\\n",
    "\\text{sin}(\\theta)\\\\\n",
    "\\omega\n",
    "\\end{bmatrix}\n",
    "\\end{align*}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-5b01fba780d2909f",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "## Solution 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-5d0c5a3e7a69720d",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "source": [
    "Model-based planning does not necessarily run faster than experience-based planning. However, experience-based planning fails to cover the whole state space especially in the earlier episodes when there are too few experiences. On the other hand, model-based planning can, of course, only be performed if a state-space model with accurate parametrization is available. In order to overcome parametric deviations between state-space model and environment, one could even use the measurements from the actual environment in order to identify the parameters of the model during runtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-11f8cac7e03b7f81",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "class pendulum_model:\n",
    "    def __init__(self, dt=0.05, m=1, g=10, l=1):\n",
    "        \"\"\"\n",
    "        Initializes the Pendulum Model.\n",
    "\n",
    "        Args:\n",
    "            dt (float): Sampling time in seconds.\n",
    "            m (float): Mass in kilograms.\n",
    "            g (float): Gravity in meters per second squared.\n",
    "            l (float): Length in meters.\n",
    "        \"\"\"\n",
    "\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "\n",
    "    def reset(self, state=None):\n",
    "        \"\"\"\n",
    "        Resets the pendulum's state.\n",
    "\n",
    "        Args:\n",
    "            state (array): Initial state (theta, omega) in radians and radians per second.\n",
    "\n",
    "        Returns:\n",
    "            state (array): Current state after reset.\n",
    "        \"\"\"\n",
    "\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "\n",
    "        return state\n",
    "\n",
    "    def step(self, T_u):\n",
    "        \"\"\"\n",
    "        Takes a step in the simulation.\n",
    "\n",
    "        Args:\n",
    "            T_u (float): Control torque applied.\n",
    "\n",
    "        Returns:\n",
    "            state (array): Current state after the step.\n",
    "            reward (float): Reward associated with the step.\n",
    "        \"\"\"\n",
    "\n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "\n",
    "        return state, reward\n",
    "\n",
    "    def angle_normalize(self, theta):\n",
    "        \"\"\"\n",
    "        Normalizes the angle.\n",
    "\n",
    "        Args:\n",
    "            theta (float): Angle in radians.\n",
    "\n",
    "        Returns:\n",
    "            float: Normalized angle in radians.\n",
    "        \"\"\"\n",
    "\n",
    "        # usage of this helper function is optional\n",
    "        return (((theta+np.pi) % (2*np.pi)) - np.pi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-21f127639f04da34",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "The following cell is for debugging of the `pendulum_model` class. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-1c1d572fbaded77d",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "env = gym.make('Pendulum-v1')\n",
    "# removes a builtin time limit of k_T = 200, we want to set the time limit ourselves\n",
    "env = env.unwrapped\n",
    "\n",
    "model = pendulum_model()\n",
    "\n",
    "state, _ = env.reset()\n",
    "\n",
    "m_state = model.reset(state)  # model is set to state of env\n",
    "\n",
    "nb_episodes = 10000\n",
    "\n",
    "for _ in range(nb_episodes):\n",
    "\n",
    "    action = env.action_space.sample()\n",
    "\n",
    "    state, reward, terminated, _, _ = env.step(action)  # take action on env\n",
    "    m_state, m_reward = model.step(action)  # take the same action on model\n",
    "\n",
    "env.close()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-48d9727d1bcf9cba",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Using $-\\mathrm{sin}(\\theta)$ instead of $\\mathrm{sin}(\\theta +\\pi)$ makes no difference when assuming analytical precision, but due to numeric errors these formulations will still yield different results in numpy, mainly because $\\pi$ is represented with finite (float) precision. In order to yield the same numbers as in `gymnasium`, we will still make use of the (more cumbersome) $\\mathrm{sin}(\\theta +\\pi)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-0a63f527a46b3115",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Write a function for the Dyna-Q algorithm, which uses the model we defined above: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You may reuse interact and learn from the previous task\n",
    "\n",
    "def learn_planning_from_model(action_values, pi, model, gamma, alpha):\n",
    "    \"\"\"Sample interactions from the environment and learn from them.\n",
    "\n",
    "    Args:\n",
    "        action_values: The action-value estimates before the update step\n",
    "        pi: The policy before the update step\n",
    "        model: The current model of the environment\n",
    "        gamma: Discount factor\n",
    "        alpha: Step size\n",
    "\n",
    "    Returns:\n",
    "        action_values: The updated action-value estimates\n",
    "        pi: The updated policy\n",
    "\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "    raise NotImplementedError()\n",
    "    return action_values, pi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-ab054c945a614362",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "def pendulumModelDynaQ(alpha, gamma, epsilon, n, nb_episodes, nb_steps, m=1, g=10, l=1):\n",
    "\n",
    "    env = gym.make('Pendulum-v1')  # , render_mode=\"human\"\n",
    "    env = env.unwrapped\n",
    "    model = pendulum_model(m=m, g=g, l=l)\n",
    "\n",
    "    action_values = np.zeros([d_theta, d_theta, d_omega, d_T])\n",
    "    pi = np.zeros([d_theta, d_theta, d_omega])\n",
    "\n",
    "    # we can use this to figure out how well the learning worked\n",
    "    cumulative_reward_history = []\n",
    "\n",
    "    pbar = tqdm(range(nb_episodes), position=0, leave=True)\n",
    "\n",
    "    for _ in pbar:\n",
    "        rewards = [] # this time, this list is only for monitoring\n",
    "        \n",
    "        # YOUR CODE HERE\n",
    "        raise NotImplementedError()\n",
    "\n",
    "        pbar.set_description(\n",
    "            f\"Cumulative_reward {np.sum(rewards):6.0f}, Progress\")\n",
    "\n",
    "        cumulative_reward_history.append(np.sum(rewards))\n",
    "\n",
    "    env.close()\n",
    "\n",
    "    return cumulative_reward_history, pi\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-6f4a8238fb568bf1",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Use the following cell to compare the learing from experience from 1) to the learning using the defined model: (Beware, nb_steps = 10000 can take some time)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-55a20ee55df4b168",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# train both setups once\n",
    "print(\"Run with planning from experience\")\n",
    "exp_planning_history, exp_planning_pi = pendulumDynaQ(alpha=0.1, gamma=0.9, epsilon=0.1, n=19, nb_episodes=500, nb_steps=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Run with planning from model\")\n",
    "model_planning_history, model_planning_pi = pendulumModelDynaQ(alpha=0.1, gamma=0.9, epsilon=0.1, n=19, nb_episodes=500, nb_steps=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(exp_planning_history, label=\"planning from experience\")\n",
    "plt.plot(model_planning_history, label=\"planning from model\")\n",
    "plt.xlabel(\"episode\")\n",
    "plt.ylabel(r\"$\\sum R$\")\n",
    "plt.legend(loc=\"upper left\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-1a7bfac51c16c83f",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Use the following cell to execute the policy we got using the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-05c7abf525fbaa40",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "experiment(model_planning_pi, nb_steps=300, render=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "cell-54c9b97a5d245359",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "source": [
    "Now we change the model parameters (e.g. $g$, $m$, $l$) so that our model differs from the \"real world\" (the gym environment).\n",
    "What do you observe? You can also change the parameters to other values yourself.\n",
    "\n",
    "The Following learning curve results from a parameter change using $g =20 \\, \\frac{\\text{m}}{\\text{s}^2}, m = 5 \\, \\text{kg and } l = 2 \\, \\text{m}$:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-a6e85462116fce6c",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "wrong_model_history, wrong_model_pi = pendulumModelDynaQ(\n",
    "    alpha=0.1, gamma=0.9, epsilon=0.1, n=19, nb_episodes=500, nb_steps=500, m=5, g=20, l=2\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-27a52b7e54657710",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "plt.plot(exp_planning_history, label=\"planning from experience\")\n",
    "plt.plot(wrong_model_history, label=\"planning from poor model\")\n",
    "plt.xlabel(\"episode\")\n",
    "plt.ylabel(r\"$\\sum R$\")\n",
    "plt.legend(loc=\"upper left\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "cell-2bc64d2fdf9455ef",
     "locked": false,
     "points": 0,
     "schema_version": 3,
     "solution": true,
     "task": false
    },
    "tags": []
   },
   "source": [
    "By changing the parameters our model differs from the \"real world\". Depending on the amount of difference, the learing curve looks worse than the one with the correct values. \n",
    "The experiment result is also depending on the random starting position. \n",
    "Depending on the parameter difference, the experiment can not be executed successfully any more. \n",
    "\n",
    "**In conclusion**: A good model can outperform experience gathered from the environment, but a poor model choice can also hinder the training process."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Create Assignment",
  "kernelspec": {
   "display_name": "RL25",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
